Learning Information Extraction Rules for Web Data Mining

نویسندگان

  • Chia-Hui Chang
  • Chun-Nan Hsu
چکیده

The explosive growth and popularity of the World Wide Web has resulted in a huge number of information sources on the Internet. However, due to the heterogeneity and the lack of structure of Web information sources, access to this huge collection of information has been limited to browsing and keyword searching. Sophisticated Webmining applications, such as comparison shopping, require expensive maintenance costs to deal with different data formats. The problem in translating the contents of input documents into structured data is called information extraction (IE). Unlike information retrieval (IR), which concerns how to identify relevant documents from a document collection, IE produces structured data ready for post-processing, which is crucial to many applications of Web mining and search tools. Formally, an information extraction task is defined by its input and its extraction target. The input can be unstructured documents like free text that are written in natural language or semi-structured documents that are pervasive on the Web, such as tables, itemized and enumerated lists, and so forth. The extraction target of an IE task can be a relation of k-tuple (where k is the number of attributes in a record), or it can be a complex object with hierarchically organized data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in a record. The difficulty of an IE task can be complicated further when various permutations of attributes or typographical errors occur in the input documents. Programs that perform the task of information extraction are referred to as extractors or wrappers. A wrapper is originally defined as a component in an information integration system that aims at providing a single uniform query interface to access multiple information sources. In an information integration system, a wrapper is generally a program that wraps an information source (e.g., a database server or a Web server) such that the information integration system can access that information source without changing its core query answering mechanism. In the case where the information source is a Web server, a wrapper must perform information extraction in order to extract the contents in HTML documents. Wrapper induction (WI) systems are software tools that are designed to generate wrappers. A wrapper usually performs a pattern-matching procedure (e.g., a form of finite-state machines), which relies on a set of extraction rules. Tailoring a WI system to a new requirement is a task that varies in scale, depending on the text type, domain, and scenario. To maximize reusability and minimize maintenance cost, designing a trainable WI system has been an important topic in research fields, including message understanding, machine learning, pattern mining, and so forth. The task of Web IE differs largely from traditional IE tasks in that traditional IE aims at extracting data from totally unstructured free texts that are written in natural language. In contrast, Web IE processes online documents that are semi-structured and usually generated automatically by a server-side application program. As a result, traditional IE usually take advantage of natural language processing techniques such as lexicons and grammars, while Web IE usually applies machine learning and pattern-mining techniques to exploit the syntactical patterns or to lay out structures of the template-based documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimizing Membership Functions using Learning Automata for Fuzzy Association Rule Mining

The Transactions in web data often consist of quantitative data, suggesting that fuzzy set theory can be used to represent such data. The time spent by users on each web page is one type of web data, was regarded as a trapezoidal membership function (TMF) and can be used to evaluate user browsing behavior. The quality of mining fuzzy association rules depends on membership functions and since t...

متن کامل

Sequential Pattern Mining for Web Extraction Rule Generalization

Information extraction (IE) is an important problem for information integration with broad applications. It is an attractive application for machine learning. The core of this problem is to learn extraction rules from given input. This paper extends a pattern discovery approach called IEPAD to the rapid generation of information extractors that can extract structured data from semi-structuredWe...

متن کامل

Using a Data Mining Tool and FP-Growth Algorithm Application for Extraction of the Rules in two Different Dataset (TECHNICAL NOTE)

In this paper, we want to improve association rules in order to be used in recommenders. Recommender systems present a method to create the personalized offers. One of the most important types of recommender systems is the collaborative filtering that deals with data mining in user information and offering them the appropriate item. Among the data mining methods, finding frequent item sets and ...

متن کامل

Text Mining with Information Extraction

The popularity of the Web and the large number of documents available in electronic form has motivated the search for hidden knowledge in text collections. Consequently, there is growing research interest in the general topic of text mining. In this paper, we develop a text-mining system by integrating methods from Information Extraction (IE) and Data Mining (Knowledge Discovery from Databases ...

متن کامل

Development of a Combined System Based on Data Mining and Semantic Web for the Diagnosis of Autism

Introduction: Autism is a nervous system disorder, and since there is no direct diagnosis for it, data mining can help diagnose the disease. Ontology as a backbone of the semantic web, a knowledge database with shareability and reusability, can be a confirmation of the correctness of disease diagnosis systems. This study aimed to provide a system for diagnosing autistic children with a combinat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015